AITopics

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.47)

Neural Information Processing SystemsFeb-10-2026, 06:25:00 GMT

9d99197e2ebf03fc388d09f1e94af89b-Paper.pdf

backdoor attack, latent space, proceedings, (13 more...)

Country:

North America > United States > California > San Francisco County > San Francisco (0.14)
North America > United States > California > San Diego County > San Diego (0.04)
North America > United States > Utah > Salt Lake County > Salt Lake City (0.04)
(20 more...)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
(2 more...)

Neural Information Processing SystemsDec-25-2025, 14:57:03 GMT

Trap and Replace: Defending Backdoor Attacks by Trapping Them into an Easy-to-Replace Subnetwork

Deep neural networks (DNNs) are vulnerable to backdoor attacks. Previous works have shown it extremely challenging to unlearn the undesired backdoor behavior from the network, since the entire network can be affected by the backdoor samples. In this paper, we propose a brand-new backdoor defense strategy, which makes it much easier to remove the harmful influence of backdoor samples from the model. Our defense strategy, \emph{Trap and Replace}, consists of two stages.

backdoor attack, classification head, trap and replace, (8 more...)

Industry: Information Technology > Security & Privacy (0.67)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.58)

arXiv.org Artificial IntelligenceDec-1-2025

Exploring Dynamic Properties of Backdoor Training Through Information Bottleneck

Liu, Xinyu, Zhang, Xu, Chen, Can, Wang, Ren

Understanding how backdoor data influences neural network training dynamics remains a complex and underexplored challenge. In this paper, we present a rigorous analysis of the impact of backdoor data on the learning process, with a particular focus on the distinct behaviors between the target class and other clean classes. Leveraging the Information Bottleneck (IB) principle connected with clustering of internal representation, We find that backdoor attacks create unique mutual information (MI) signatures, which evolve across training phases and differ based on the attack mechanism. Our analysis uncovers a surprising trade-off: visually conspicuous attacks like BadNets can achieve high stealthiness from an information-theoretic perspective, integrating more seamlessly into the model than many visually imperceptible attacks. Building on these insights, we propose a novel, dynamics-based stealthiness metric that quantifies an attack's integration at the model level. We validate our findings and the proposed metric across multiple datasets and diverse attack types, offering a new dimension for understanding and evaluating backdoor threats. Our code is available in: https://github.com/XinyuLiu71/Information_Bottleneck_Backdoor.git.

artificial intelligence, backdoor attack, machine learning, (18 more...)

2511.21923

Country: North America > United States > Michigan (0.28)

Genre: Research Report > New Finding (0.48)

Industry: Information Technology > Security & Privacy (0.69)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)

arXiv.org Artificial IntelligenceNov-13-2025

Detecting Stealthy Backdoor Samples based on Intra-class Distance for Large Language Models

Chen, Jinwen, Zhang, Hainan, Sun, Fei, Zhang, Qinnan, Wen, Sijia, Wang, Ziwei, Zheng, Zhiming

Stealthy data poisoning during fine-tuning can backdoor large language models (LLMs), threatening downstream safety. Existing detectors either use classifier-style probability signals--ill-suited to generation--or rely on rewriting, which can degrade quality and even introduce new triggers. We address the practical need to efficiently remove poisoned examples before or during fine-tuning. We observe a robust signal in the response space: after applying TF-IDF to model responses, poisoned examples form compact clusters (driven by consistent malicious outputs), while clean examples remain dispersed. We leverage this with RFTC--Reference-Filtration + TF-IDF Clustering. RFTC first compares each example's response with that of a reference model and flags those with large deviations as suspicious; it then performs TF-IDF clustering on the suspicious set and identifies true poisoned examples using intra-class distance. On two machine translation datasets and one QA dataset, RFTC outperforms prior detectors in both detection accuracy and the downstream performance of the fine-tuned models. Ablations with different reference models further validate the effectiveness and robustness of Reference-Filtration.

computational linguistic, large language model, machine learning, (19 more...)

2505.23015

Country:

North America > United States (1.00)
Europe (1.00)
Asia > Middle East > UAE (0.28)

Genre: Research Report (1.00)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

arXiv.org Artificial IntelligenceSep-25-2025

DyePack: Provably Flagging Test Set Contamination in LLMs Using Backdoors

Cheng, Yize, Wang, Wenxiao, Moayeri, Mazda, Feizi, Soheil

Open benchmarks are essential for evaluating and advancing large language models, offering reproducibility and transparency. However, their accessibility makes them likely targets of test set contamination. In this work, we introduce DyePack, a framework that leverages backdoor attacks to identify models that used benchmark test sets during training, without requiring access to the loss, logits, or any internal details of the model. Like how banks mix dye packs with their money to mark robbers, DyePack mixes backdoor samples with the test data to flag models that trained on it. We propose a principled design incorporating multiple backdoors with stochastic targets, enabling exact false positive rate (FPR) computation when flagging every model. This provably prevents false accusations while providing strong evidence for every detected case of contamination. We evaluate DyePack on five models across three datasets, covering both multiple-choice and open-ended generation tasks. For multiple-choice questions, it successfully detects all contaminated models with guaranteed FPRs as low as 0.000073% on MMLU-Pro and 0.000017% on Big-Bench-Hard using eight backdoors. For open-ended generation tasks, it generalizes well and identifies all contaminated models on Alpaca with a guaranteed false positive rate of just 0.127% using six backdoors.

backdoor, large language model, machine learning, (13 more...)

2505.23001

Country: North America > United States (0.28)

Genre: Research Report > New Finding (0.46)

Industry:

Information Technology > Security & Privacy (0.89)
Education (0.55)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)

arXiv.org Artificial IntelligenceSep-9-2025

Backdoor Samples Detection Based on Perturbation Discrepancy Consistency in Pre-trained Language Models

Peng, Zuquan, Fu, Jianming, Zou, Lixin, Zheng, Li, Ren, Yanzhen, Peng, Guojun

The use of unvetted third-party and internet data renders pre-trained models susceptible to backdoor attacks. Detecting backdoor samples is critical to prevent backdoor activation during inference or injection during training. However, existing detection methods often require the defender to have access to the poisoned models, extra clean samples, or significant computational resources to detect backdoor samples, limiting their practicality. To address this limitation, we propose a backdoor sample detection method based on perturbatio\textbf{N} discr\textbf{E}pancy consis\textbf{T}ency \textbf{E}valuation (\NETE). This is a novel detection method that can be used both pre-training and post-training phases. In the detection process, it only requires an off-the-shelf pre-trained model to compute the log probability of samples and an automated function based on a mask-filling strategy to generate perturbations. Our method is based on the interesting phenomenon that the change in perturbation discrepancy for backdoor samples is smaller than that for clean samples. Based on this phenomenon, we use curvature to measure the discrepancy in log probabilities between different perturbed samples and input samples, thereby evaluating the consistency of the perturbation discrepancy to determine whether the input sample is a backdoor sample. Experiments conducted on four typical backdoor attacks and five types of large language model backdoor attacks demonstrate that our detection strategy outperforms existing zero-shot black-box detection methods.

large language model, machine learning, natural language, (16 more...)

doi: 10.1016/j.neunet.2025.108025

2509.05318

Country: Asia > China (0.28)

Genre: Research Report > New Finding (1.00)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.67)

arXiv.org Artificial IntelligenceAug-22-2025

BadFU: Backdoor Federated Learning through Adversarial Machine Unlearning

Lu, Bingguang, Hu, Hongsheng, Miao, Yuantian, Sohail, Shaleeza, He, Chaoxiang, Wang, Shuo, Chen, Xiao

Federated learning (FL) has been widely adopted as a decentralized training paradigm that enables multiple clients to collaboratively learn a shared model without exposing their local data. As concerns over data privacy and regulatory compliance grow, machine unlearning, which aims to remove the influence of specific data from trained models, has become increasingly important in the federated setting to meet legal, ethical, or user-driven demands. However, integrating unlearning into FL introduces new challenges and raises largely unexplored security risks. In particular, adversaries may exploit the unlearning process to compromise the integrity of the global model. In this paper, we present the first backdoor attack in the context of federated unlearning, demonstrating that an adversary can inject backdoors into the global model through seemingly legitimate unlearning requests. Specifically, we propose BadFU, an attack strategy where a malicious client uses both backdoor and camouflage samples to train the global model normally during the federated training process. Once the client requests unlearning of the camouflage samples, the global model transitions into a backdoored state. Extensive experiments under various FL frameworks and unlearning strategies validate the effectiveness of BadFU, revealing a critical vulnerability in current federated unlearning practices and underscoring the urgent need for more secure and robust federated unlearning mechanisms.

artificial intelligence, global model, machine learning, (17 more...)

2508.15541

Country:

Asia (0.46)
North America > United States (0.28)

Genre: Research Report > New Finding (1.00)

Industry:

Law (1.00)
Information Technology > Security & Privacy (1.00)
Government (1.00)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)

Neural Information Processing SystemsAug-19-2025, 16:07:12 GMT

A More related works

In this section, we discuss more related works in addition to those in Section 2. In this section, we provide more details on our experimental settings, in addition to those in Section 4.1. Below we describe other detailed settings of each defense method. Normal training (i.e., "No defense") On CIFAR10 and GTSRB, we train for I-BAU The original I-BAU paper conducted experiments on a relatively small convolutional network. In this section, we provide more experimental results in addition to those in Section 4. C.1 Potential adaptive attack The results are shown in Table 8. Alongside ASR and CA, we also show the mean square error (MSE) of the image reconstruction. Smaller MSE roughly indicates better image reconstruction quality.

artificial intelligence, dataset, machine learning, (18 more...)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.47)